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Abstract 

This  report  refutes  claims  that  adaptive  routing  performs  better  than 
dimension-order  routing.  Simulation  results  are  presented  that  show 
dimension-order  routing  achieves  both  higher  throughput  and  lower 
latency  than  adaptive  routing.  Specious  claims  for  the  advantages  of 
adaptive  routing  are  critiqued. 


1  Introduction 

Adaptive  routing  [1]  is  an  alluring  idea.  Although  there  are  usually 
many  minimum-distance  paths  between  two  nodes  in  a  multicomputer 
message-passing  network,  the  dimension-order  routing  used  in  existing 
networks  [2]  always  routes  a  packet  from  a  given  source  to  a  given 
destination  along  the  same  path.  Adaptive  routing  allows  a  packet 
to  follow  any  minimal  path  from  source  to  destination,  and  would 
seem  to  offer  an  opportunity  to  decrease  latency,  diffuse  local  areas 
of  congestion,  increase  channel  utilization  (throughput),  and  improve 
fault  tolerance. 

The  most  attractive  feature  of  adaptive  routing  was  its  promise  to 
double  network  throughput.  Previous  simulation  studies  [1]  indicated 
that  networks  using  dimension-order  routing  could  utilize  only  ~  50% 
of  their  bisection  bandwidth,  whereas  adaptive-routing  networks  could 
utilize  «  90%  of  their  bisection  bandwidth.  Improving  throughput 
was  the  primary  motivation  for  adaptive  routing  because  throughput 
is  more  important  than  latency,  and  claims  for  traffic  diffusion  and 
fault  tolerance  had  not  been  demonstrated. 
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A  VLSI  implementation  of  adaptive  routing  was  undertaken  [3] 
with  the  belief  that  adaptive  routing  would  increase  throughput,  dif¬ 
fuse  hot-spots,  and  improve  fault  tolerance.  In  preparation  for  making 
a  detailed  design,  an  architecture  [4]  and  simulators  [5]  were  developed 
to  reproduce,  refine,  and  extend  the  earlier  studies.  The  architecture 
was  sufficiently  general  that  it  could  implement  either  dimension-order 
routing  (DOR)  or  adaptive  routing  (AR).  The  simulators  were  signifi¬ 
cantly  faster  and  more  flexible  than  simulators  used  in  earlier  studies, 
and  a  wider  range  of  network  topologies,  sizes,  and  characteristics  were 
simulated.  In  particular,  it  was  possible  to  compare  DOR  and  AR  for 
realistic  network  sizes  [6]  while  keeping  everything  but  the  routing 
algorithm  fixed. 

Throughput:  It  was  discovered  that  the  earlier  results  showing  a 
performance  advantage  for  AR  over  DOR  were  an  artifact  of  giving 
the  adaptive  routers  more  buffering  than  the  dimension-order  routers: 

Fallacy  1  Dimension-order  routers  can  utilize  only  m  50% 
of  the  bisection  bandwidth,  whereas  adaptive  routers  can 
utilize  ~  90%  of  the  bisection  bandwidth. 

Fact  1  Dimension-order  routing  allows  optimal  bandwidth 
utilization.  As  the  network  radix  increases,  the  bisection 
utilization  of  DOR  approaches  100%.  When  given  equal 
buffering,  DOR  can  support  higher  throughput  than  AR. 

Latency:  The  intuitive  notion  that  AR  can  reduce  latency  by  cir¬ 
cumnavigating  local  congestion  is  also  incorrect.  Indeed,  blocking  due 
to  output  competition  is  very  rare  with  DOR,  especially  for  networks 
with  large  radices.  With  DOR,  an  output  channel  is  used  almost  ex¬ 
clusively  by  the  corresponding  input  channel  in  the  same  dimension, 
except  for  injections  from  the  previous  dimensions  with  probability 
0(-^)  for  network  radix  II.  Since  contention  is  rare,  AR  does  not 
perform  better  than  DOR.  Moreover,  DOR  performs  better  than  AR 
for  heavy  applied  loads.  DOR  utilizes  all  bisection  channels  equally, 
and  achieves  the  maximum  possible  throughput.  AR  does  not  utilize 
all  the  mesh’s  bisection  channels  evenly:  a  packet  is  allowed  to  follow 
any  minimal  path  from  source  to  destination,  and  many  more  paths 
cross  the  middle  of  the  bisection  than  the  cross  the  edges  of  the  bisec¬ 
tion.  AR  creates  a  surfeit  of  congestion  in  the  center  of  the  mesh  and 
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under-utilizes  the  edges;  this  prevents  AR  from  achieving  maximal 
throughput,  and  leads  to  much  higher  latency  than  DOR,  for  heavy 
traffic.  In  summary: 

Fallacy  2  Adaptive  routing  decreases  latency  by  routing 
packets  around  congestion. 

Fact  2  Adaptive  routing  increases  latency  for  heavy  traf¬ 
fic.  Output  contention  is  rare,  O(^),  for  dimension-order 
routing.  Dimension-order  routing  produces  less  network 
congestion  under  heavy  traffic,  and  achieves  higher  through¬ 
put. 

“Hot  Spots”:  The  notion  that  adaptive  routing  would  improve 
performance  by  avoiding  “hot  spots”  is  specious.  Claims  about  hot 
spots  appeal  to  intuition,  but  they  have  not  been  accompanied  by  a 
precise,  realistic  definition  of  “hot  spot.”  If  a  hot  spot  is  a  random, 
local  fluctuation  in  traffic,  then  the  simulations  of  random  traffic  pre¬ 
sented  in  this  report  show  that  DOR,  not  AR,  gives  better  perfor¬ 
mance.  If  a  hot  spot  is  a  chronic  region  of  abnormally  heavy  traffic, 
then  hot  spots  are  pathological  cases.  Chronic  regions  of  congestion 
can  be  attributed  to  poor  program  design  or  poor  process  placement, 
and  networks  are  not  designed  to  remedy  the  ills  of  a  particular  pro¬ 
gram.  For  a  specific  definition  of  “hot  spot,”  the  burden  of  proof 
lies  with  the  AR  proponent,  who  must  show  that  such  hot  spots  oc¬ 
cur  in  real  networks,  that  they  degrade  network  performance,  that 
they  are  best  handled  in  the  network  itself,  and  that  AR  improves 
their  handling  significantly.  One  could  almost  certainly  concoct  a 
pathological  traffic  pattern  that  favors  AR,  but  the  design  of  general- 
purpose  routing  networks  cannot  be  based  upon  a  special-case  traffic 
pattern;  random  traffic  is  the  most  general  model.  Communication 
patterns  that  exhibit  locality  cannot  benefit  from  AR  because  there  is 
negligible  path  multiplicity  for  short  paths.  Communication  patterns 
that  do  not  exhibit  locality  can  use  random  process  placement  to  avoid 
pathological  congestion,  and  DOR  outperforms  AR  for  random  traffic. 

Fallacy  3  Adaptive  routing  improves  performance  by  dif¬ 
fusing  “ hot  spots.” 

Fact  3  This  nebulous  claim  has  never  been  substantiated. 
Localized  traffic  cannot  benefit  from  AR  because  it  lacks 
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path  multiplicity.  Random  traffic  is  the  accepted  worst-case 
model  for  non-localized  traffic,  and  DOR  outperforms  AR 
for  random  traffic. 


Fault  Tolerance:  Fault  tolerance  is  a  popular  concept,  but  the 
term  is  often  used  loosely.  Since  AR.  allows  a  packet  to  choose  from 
more  than  one  path  through  the  network,  “fault  tolerance”  is  some¬ 
times  listed  among  the  virtues  of  AR..  When  actual  studies  of  the  fault 
tolerance  of  AR  have  been  done  [1],  the  results  for  realistic  topologies 
[6]  have  been  poor.  AR  does  not  provide  redundancy  for  all  paths; 
this  fact  alone  is  sufficient  to  discredit  claims  that  AR  provides  fault 
tolerance. 


Note  1  In  a  radix-R  d-dimensional  mesh,  the  number  of 
minimum-distance  paths  between  two  nodes  separated  by 
(Aa'i  Aaal  is 

.  .  .,L\£d)  IS  (Aa.1)i...(Aa.d)i  ■ 


Note  2  There  are  Rd  ( d(R  —  1)  -)-  1)  node  pairs  that  have 
only  one  minimum-distance  path  between  them.  Even  when 
multiple  paths  exist,  they  may  overlap  significantly  and  are 
not  independent. 

Building  a  fault -tolerant  network  requires  an  intentional  design  effort, 
and  well-defined  reliability  goals.  The  fact  that  a.  routing  algorithm 
incidentally  yields  path  redundancy  for  some  ( src ,  dest)  pairs  does  not 
justify  claims  of  fault  tolerance. 

Indeed,  some  legitimate  approaches  to  fault  tolerance  1  are  easier 
to  layer  atop  a  network  that  uses  DOR.  With  AR,  any  one  of  many 
paths  might  be  taken  by  a  packet,  so  delivery  can  be  guaranteed  only  if 
all  such  paths  are  fault  free.  With  DOR  and  static  faults,  a  (src,  dst ) 
pair  always  works  or  always  fails. 

Fallacy  4  Adaptive  routing  provides  fault  tolerance. 

Fact  4  Adaptive  routing  does  not  provide  redundancy  for 
all  paths,  and  it  does  not  provide  practical  fault  tolerance. 

The  possibility  that  a  packet  might  follow  any  one  of  many 

1 A  faulty  (src,  dst)  path  can  be  rerouted  through  an  intermediate  node:  (src,  i ),  (i,  dst). 

A  fixed  intermediate  can  be  chosen  for  each  broken  path  when  static  faults  are  recorded. 

Such  an  approach  requires  no  special  hardware  and  can  be  used  in  existing  machines. 
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different  paths  makes  it  harder  to  guarantee  that  a  packet 
will  be  delivered  in  a  faulty  network  using  AR. 


2  Simulation  Results 

This  section  presents  measurements  of  network  latency  obtained  from 
a  simulator  [5].  The  simulation  parameters  are  (d,  R,  L,A),  where  d  is 
the  mesh  dimension,  R  is  the  mesh  radix,  L  is  the  packet  length  in 
flits,2  and  A  is  the  applied  load.  The  number  of  nodes  is  I  =  Rd,  and 
the  bisection  bandwidth  is  B  =  Rd_  1  The  simulator 

uses  random,  homogeneous  traffic:  on  every  cycle  each  node  gener¬ 
ates  a  packet  with  probability  q  =  and  all  ( src,dst )  pairs  are 
equiprobable. 

4  A  packets  4  A  flits  4  ATT  bits 
^  RL  cycle  R  cycle  R  cycle 

The  applied  load  (A)  is  expressed  as  a  fraction  of  the  network’s  bi¬ 
section  bandwidth  (B);  in  steady  state,  the  bisection  utilization  (U)  is 
equal  to  the  applied  load  (A),  and  the  throughput  is  T  =  4BU.3 

The  steady-state  average  cut-through  latency  T  is  measured  for  a 
range  of  applied  loads:  A  =  10%,  30%,  50%,  70%.  Measurements  are 
made  for  several  mesh  radices  R  and  dimensions  d.  The  measured 
latency  T  is  the  average  time  between  the  sending  of  a  packet  from 
the  source  and  its  arrival  at  the  destination,  including  injection  and 
cut-through  latency  but  not  spooling  latency.  Cut-through  latency  is 
the  head-to-head  transmission  delay.  Since  the  injection  queue  is  just 
another  input  FIFO,  injection  latency  and  cut-through  latency  are  not 
separated.  Spooling  latency  refers  to  the  L-cycle  head-to-tail  delay. 

The  basic  simulator  has  been  described  in  a  previous  report  [5],  so 
only  minimal  discussion  is  included  here.  Results  are  reported  for  two 
simulator  variants:  SNS  (which  uses  DOR)  and  ANS  (which  uses  AR). 
The  versions  of  SNS  and  ANS  used  to  produce  the  data  for  this  report 
differ  from  those  previously  reported  [5]  in  some  minor  respects.  The 
mainQ  function  has  been  changed  to  wait  for  the  throughput  to  con¬ 
verge  before  waiting  for  the  latency  to  converge.  Also,  a  measurement 
of  the  average  queue  length  (AQLEW)  has  been  added.  A  listing  of  the 

2One  flit  is  W  bits,  where  W  is  the  channel  width. 

3For  random  traffic,  4  of  all  packets  cross  the  bisection,  4  in  each  direction. 
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SNS  code  used  for  this  report  is  included  as  an  appendix.  ANS  differs 
from  SNS  only  in  the  following  function: 

int  allowed (n , in, out )  int  n, in, out;  { 

packet  *p=node [n] .head[in] ;  int  pc ,nc ,dim=DIMQF(out) ; 
if(!p  ||  p->tin>curtirae)  return  0;  /*  p  arrived?  */ 

if (node [n] .tnh[in] >curtime)  return  0;  /*  p  at  head?  */ 

if  (node [n] .tf ree [out] >curtime)  return  0;  /*  out  free?  */ 

if(p->dest  ==  n)  return  out==0 ;  /*  p  at  dest?  */ 

pc=C00RD(p->dest ,dim) ,  nc=C00RD(n ,dim) ; 
if (nc<pc)  return  out==SUCC(dira) ; 
if (nc>pc)  return  out==PRED(dim) ; 

return  0;  /*  not  profitable  */ 

} 

Except  where  otherwise  noted,  the  simulations  use  packet  length 
L=32  flits,  which  is  realistic  for  fine-grained  computations  [7,  8].  For 
comparison,  L=8  and  L=128  results  are  presented  in  an  appendix. 

The  simulations  were  run  using  ACCURACY= .  03,  which  is  more 
than  sufficient  to  show  that  DOR  is  preferable  to  AR.  The  execution 
times  for  the  simulations  would  increase  by  an  order  of  magnitude  if 
ACCURACY=.01  were  used  instead. 

The  simulator  measures  the  network-average  queue-length  at  ter¬ 
mination,  not  the  time-average  queue-length:  AQLEI  is  the  number  of 
packets  that  have  been  sent  but  not  received  divided  by  the  number 
of  queues.  The  AQLEI  values  are  listed  to  provide  insight,  but  the  sim¬ 
ulator  does  not  check  their  convergence.  Inspecting  the  AQLEI  values 
shows  that  the  average  queue  length  is  typically  only  a  fraction  of  a 
packet.  It  can  also  be  seen  that  the  average  queue  length  is  greater 
with  AR  than  with  DOR  for  heavy  traffic.  The  AQLEI  values  are 
coarse  and  may  appear  noisy. 


2.1  One-Dimensional  Results 

There  is  no  difference  between  AR  and  DOR  for  ID  because  there  is 
only  one  path  between  any  two  nodes.  ANS  and  SNS  produce  iden¬ 
tical  results  for  d=l.  Although  ID  results  cannot  be  used  to  compare 
AR  and  DOR,  they  are  included  for  completeness;  ID  is  the  simplest 
case  and  will  be  studied  before  proceeding  to  2D  and  3D.  Some  ID 
simulation  results  are  shown  in  Table  1. 

Observation  1  Average  cut-through  latency  is  not  directly 
proportional  to  network  radix,  so  it  is  not  directly  propor¬ 
tional  to  average  distance. 
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A= .  1 

CO 

II 

A= .  5 

II 

R 

T 

AQLEI 

T 

AQLEI 

T 

AQLEI 

T 

AQLEI 

4 

3.95 

0.00 

10.8 

0.00 

34.8 

0.46 

335 

3.18 

8 

5.47 

0.00 

11.5 

0.04 

26.4 

0.13 

73.6 

0.65 

16 

8.37 

0.00 

14.9 

0.00 

29.7 

0.06 

71.5 

0.11 

32 

13.8 

0.01 

20.6 

0.00 

36.1 

0.04 

72.2 

0.04 

64 

24.7 

0.00 

31.9 

0.00 

45.8 

0.02 

80.6 

0.01 

128 

46.1 

0.00 

53.7 

0.00 

67.0 

0.00 

103 

0.02 

Table  1:  ID  Data 


Observation  2  For  a  given  applied  load  there  is  an  “ op¬ 
timal ”  radix  for  which  the  cut-through  latency  is  minimal. 

The  optimal  radix  increases  with  throughput. 

When  a  packet  traverses  an  empty  network,  its  cut-through  latency 
is  Tcut  =  D  +  1:  there  is  a  one-cycle  delay  for  each  hop  along  a  path 
of  distance  D,  and  a  one-cycle  delay  at  the  destination.  The  aver¬ 
age  distance  in  a  radix- R  d- dimensional  mesh  is  D  =  |  j  [10]. 

Congestion  in  the  network  increases  the  cut-through  latency.  Even  for 
light  traffic  (A=  10°/,)  congestion  cannot  be  neglected;  the  congestion- 
free  latency  formula  is  correct  only  when  the  applied  load  is  extremely 
small  (a  few  percent  of  network  capacity).  The  congestion- free  latency 
formula  is  not  even  qualitatively  correct:  according  to  the  formula, 
average  cut-through  latency  should  increase  linearly  with  average  dis¬ 
tance,  but  latency  is  not  proportional  to  distance.  Latency  grows 
only  linearly  with  average  distance  (radix),  but  it  grows  supra- linearly 
with  channel  utilization  (throughput).  According  to  the  Pollaczek- 
Khinchin  formula,  average  queue  delay  grows  with  queue  utilization 
( u )  as  As  radix  increases,  blocking  becomes  rarer,  so  queue  uti¬ 

lization  decreases. 

Note  3  With  DOR ,  an  incoming  packet  continues  in  the 
same  dimension  unless  its  offset  in  that  dimension  is  zero. 

If  the  offset  in  the  current  dimension  has  been  reduced  to 
zero,  the  packet  is  dejected  into  the  next  dimension.  A 
packet  continuing  in  its  present  dimension  is  blocked  by 
output  competition  only  if  the  output  is  being  used  by  an 
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injection  from  the  previous  dimension.  The  average  injec¬ 
tion  rate  is  q  =  ^{~r<  so  the  probability  of  blocking  is 

o  (£)■ 

As  radix  increases,  the  average  distance  increases  but  the  average 
delay  per  hop  decreases.  For  a  given  applied  load,  there  is  a  latency- 
optimal  radix  that  gives  the  smallest  cut-through  latency.  The  radix 
that  gives  the  smallest  latency  increases  with  applied  load  as  queueing 
delay  becomes  more  important. 

To  give  a  better  sampling  of  the  latency  versus  applied  load  curve, 
more  data  is  presented  in  Table  2. 


R 

> 

II 

CD 

CD 

A= .  6 

A= .  65 

A= .  75 

A= .  8 

A= .  85 

A= .  9 

A= .  95 

4 

47.1 

73.0 

123 

oo 

OO 

OO 

OO 

OO 

8 

33.2 

44.1 

55.4 

112 

193 

824 

oo 

oo 

16 

35.0 

44.1 

52.9 

Ena 

140 

205 

1233 

oo 

32 

40.3 

48.0 

57.4 

91.5 

126 

180 

353 

oo 

64 

51.7 

58.5 

68.6 

97.4 

125 

182 

284 

833 

128 

72.4 

81.3 

89.9 

121 

144 

190 

295 

611 

Table  2:  More  ID  Latencies 


Observation  3  Latency  is  a  smooth,  super-linear,  mono- 
tonically  increasing  function  of  throughput  (A )  that  diverges 
for  applied  loads  above  a  radix-dependent  maximum  bisec¬ 
tion  utilization  Umax. 

Observation  4  Umax  increases  with  radix. 

The  existence  of  a  maximum  bisection  utilization  less  than  100%  can 
be  readily  deduced  for  random  traffic.  Consider  the  bisection  channel 
and  the  node  that  feeds  that  channel.  The  channel  can  be  used  ei¬ 
ther  by  a  new  injection  at  that  node  or  by  a  packet  continuing  from 
the  node’s  predecessor.  When  the  bisection  channel  is  not  used  for 
either  injection  or  forwarding,  it  is  idle;  thus,  the  bisection  utilization 
is  limited  to  less  than  100%.  The  probability  that  the  node  injects  a 
packet  that  uses  the  bisection  is  |,  and  A  <  1  =8  </  <  4  The  proba¬ 

bility  that  a  packet  continues  from  the  predecessor  across  the  bisection 

4Unless  R  >  4,  this  bound  is  vacuous. 
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channel  is  i)+i  u ’  w^iere  u  is  1  he  utilization  of  the  channel  from  the 
predecessor.  Thus,  the  probability  that  the  bisection  channel  is  idle  is 
Pidle  >  (l  -  Ti)  (l  -  1+(2/fl))  and  Ub  =  1-  Pidie  is  bounded  by 


U b  <  1 


2  (R-  2) 
R  (  R  +  2) 


2.2  Two-Dimensional  Results 

Although  data  comparing  AR  to  DOR  for  ID  and  3D  meshes  is  also 
presented,  the  comparison  for  2D  meshes  is  the  most  realistic,  prac¬ 
tical,  and  interesting.  This  research  is  part  of  a  large  multicomputer 
design  project  called  the  Mosaic  [6].  The  Mosaic  is  a  128  X  128  mesh  of 
single-chip  multicomputer  nodes;  each  node  contains  an  11  MIPS  pro¬ 
cessor,  64  KB  memory,  and  asynchronous  DOR  network  router  with 
60MB/s  channels.  Since  the  2D  mesh  is  the  topology  used  in  state- 
of-the-art  machines  [6,  11,  12],  it  is  the  most  appropriate  topology  for 
comparing  AR  to  DOR.  5 

The  2D  simulation  results  are  shown  in  Table  3.  T  and  AQLEN  are 
listed  as  “oo”  when  the  latency  and  queue  lengths  did  not  converge, 
i.e.,  when  they  grew  without  bound. 

Observation  5  DOR  can  support  higher  throughput  than 
AR.  Umax  is  a  decreasing  function  of  dimension  for  AR. 

Observation  6  DOR  yields  much  lower  latency  than  AR 
for  heavy  traffic.  AR  yields  slightly  lower  latency  for  light 
traffic. 

Observation  7  The  performance  advantage  of  DOR  in¬ 
creases  with  radix.  DOR  begins  beating  AR  at  lower  applied 
loads  for  larger  radices.  DOR  beats  AR  at  any  applied 
load  for  a  128x128  mesh.6 

Observation  8  Average  queue  length  is  typically  only  a 
fraction  of  a  packet,  and  queue  lengths  decrease  as  radix 
increases. 

Observation  9  Differences  in  latency  between  DOR  and 
AR  reflect  differences  in  average  queue  length. 

5For  planar  wiring,  2D  networks  are  throughput  optimal  [13]. 

6In  an  empty  network,  Tcut  =  D  +  l  =>  T  —  D  +  1,  for  both  AR  and  DOR. 
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Table  3:  2D  Data 
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Even  when  AR  latency  is  less  than  DOR  latency,  the  difference 
is  far  too  small  to  justify  the  greater  cost  ( e.g. ,  area  and  complexity) 
of  AR..  Moreover,  it  is  precisely  when  network  performance  matters 
most  ( i.e. ,  at  high  applied  loads)  that  DOR  performs  significantly 
better  than  AR.  The  extreme  difference  in  latency  between  AR  and 
DOR  for  heavy  applied  loads  can  be  attributed  to  the  difference  in 
Umax- 

The  approximate  formula  for  Umax  that  was  derived  for  ID  is  still 
applicable  for  2D  with  DOR,  since  2D  DOR  can  be  regarded  as  two 
independent  ID  routings.  The  Umax  bound  is  a  property  of  random 
traffic,  not  a  property  of  the  routing  algorithm.  DOR  can  support 
maximal  throughput. 

AR.  does  not  support  maximal  throughput.  AR.  has  a  lower  Umax 
than  DOR.  The  lower  throughput  of  AR  is  a  property  of  the  routing 
algorithm.  DOR  achieves  maximal  throughput  because  every  channel 
of  the  bisection  is  utilized  maximally.  AR  cannot  support  maximal 
throughput  because  it  does  not  utilize  all  bisection  channels  evenly. 
With  AR,  channels  at  the  center  of  the  bisection  are  utilized  more 
than  channels  at  the  edges  of  the  bisection  because  more  of  a  packet’s 
possible  paths  pass  through  the  center  of  the  mesh.  If  the  utilizations 
of  the  bisection  channels  are  independently  monitored,  then  Umax  is 
seen  to  be  the  DOR  value  multiplied  by  the  ratio  of  the  mean  to  the 
peak  of  the  AR  bisection-utilization  profile  [14].  Note  that,  even  if  an 
AR  implementation  could  be  devised  that  performs  as  well  as  DOR, 
AR  cannot  beat  DOR. 

For  a  given  number  of  nodes  and  a  given  bisection  bandwidth, 
the  highest  throughput  is  achieved  by  the  network  of  the  largest  radix 
(smallest  dimension)  because  Umax  increases  with  R.  If  bisection  band¬ 
width  were  a  realistic  measure  of  network  cost,  this  would  imply  that 
ID  networks  were  optimal.  However,  in  reality,  a  ID  network  is  much 
more  expensive  than  a  2D  network  with  the  same  bisection  bandwidth. 
Wiring  area  is  a.  much  more  realistic  cost  metric  than  bisection  band¬ 
width,  and  for  a  given  wiring  area,  the  highest  bisection  bandwidth 
is  achieved  by  using  a  2D  network  [13].  Therefore,  large-I  machines 
(e.g.,  1=16384)  have  large  radices  (e.g.,  R=128),  and  for  large  radices 
DOR  outperforms  AR  a  fortiori. 

As  noted  earlier,  the  simulators  used  for  this  report  give  only  a 
crude  measure  of  the  average  queue  length.  AQLEI  is  the  network- 
average  queue  length  when  the  simulator  terminated.  A  proper  mea- 


11 


surement  of  the  average  queue  length  would  be  the  time  average  of 
the  network  average.  If  one  were  interested  in  measuring  the  aver¬ 
age  queue  length  properly,  one  could  modify  the  simulator  to  record 
AQLEN  every  cycle;  the  time  average  of  AQLEN  could  be  monitored  for 
convergence  the  way  T  is  monitored.  Given  that  AQLEN  is  a  crude 
measurement,  not  much  should  be  deduced  from  the  tabulated  val¬ 
ues.  However,  it  seems  safe  to  conclude  from  the  above  data  that 
fairly  short  FIFOs  are  sufficient  for  real  router  implementations.  Ex¬ 
isting  routers  use  FIFO  lengths  that  are  only  a  fraction  of  the  average 
packet  length/  and  the  simulation  results  suggest  that  little  perfor¬ 
mance  improvement  would  result  from  using  longer  FIFOs.  8 

2.3  Three-Dimensional  Results 

Using  higher- dimensional  meshes  should  favor  AR.  As  the  number  of 
dimensions  increases,  so  does  the  path  multiplicity.  With  AR.,  a  packet 
is  allowed  to  reduce  its  offset  in  any  of  the  d  dimensions  at  any  time, 
so  it  may  have  up  to  d  allowed  outputs.  With  DOR,  there  is  always 
exactly  1  allowed  output.  Thus,  the  “difference”  between  DOR  and 
AR  is  proportional  to  d.  For  d  —  1,  DOR  and  AR  are  identical.  The 
performances  of  AR  and  DOR  are  expected  to  differ  more  for  3D  than 
they  did  for  2D.  The  3D  simulation  results  are  shown  in  Table  4. 

Observation  10  For  fixed  radix,  latency  is  not  directly 
proportional  to  dimension.  9 

Observation  11  For  a  fixed  number  of  nodes,  lower  di¬ 
mension  and  larger  radix  gives  better  performance  for  heavy 
traffic. 

Comparing  the  3D  data  to  the  2D  data  shows  that  the  difference 
between  DOR  and  AR  performance  increases  with  dimension.  The 
slight  latency  advantage  of  AR.  for  light  traffic  is  greater  for  3D  net¬ 
works.  The  lower  throughput  and  higher  latency  of  AR  for  heavy 
traffic  is  more  pronounced  for  3D  networks.  Path  multiplicity  is  both 
the  virtue  and  the  vice  of  AR,  and  it  increases  with  dimension. 

'A  packet  can  be  spread  over  multiple  nodes  when  it  cannot  fit  into  a  single  FIFO. 

8Witli  the  FIFO  length  equal  to  the  packet  length,  the  performance  is  the  same  as  for 
infinite  buffering  for  a  128  x  128  mesh  with  A<80’/,  [5]. 

9If  T  were  proportional  to  average  distance,  it  would  be  proportional  to  d. 
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DOR  (SNS) 

AR  (AIS) 

R 

A 

T 

AQLEI 

T 

AQLEI 

4 

.1 

8.90 

7.92 

4 

.3 

25.2 

20.4 

KB 

4 

.5 

74.7 

65.8 

EH 

4 

.6 

149 

00 

oo 

4 

.7 

467 

1.74 

00 

oo 

8 

.1 

14.1 

12.0 

8 

.3 

32.8 

22.9 

8 

.5 

79.1 

55.8 

8 

.7 

265 

oo 

oo 

■9 

.1 

23.2 

20.6 

H 

.3 

44.4 

34.0 

16 

.5 

93.2 

77.5 

16 

.7 

247 

oo 

oo 

32 

.1 

39.8 

37.3 

32 

.3 

62.8 

54.3 

32 

.5 

112 

113 

32 

.7 

250 

00 

oo 

Table  4:  3D  Data 
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Compare  the  results  for  a  4  X  4  X  4  mesh  to  the  results  for  an  8  X  8 
mesh,  and  compare  the  results  for  a  16x16x16  mesh  to  the  results  for 
a  64  X  64  mesh.  These  are  the  cases  in  which  a  2D  mesh  and  a  3D  mesh 
have  the  same  number  of  nodes.  The  average  distance  is  smaller  for  the 
3D  meshes  than  for  the  2D  meshes:  3.75  versus  5.25  and  15.9  versus 
42.7.  The  smaller  average  distance  leads  to  lower  latency  for  light 
traffic.  However,  the  2D  meshes  beat  the  3D  meshes  for  large  applied 
load.  Output  contention  is  proportional  to  dimension  and  inversely 
proportional  to  radix,  so  3D  meshes  have  longer  queue  lengths.  Since 
latency  grows  only  linearly  with  average  distance  but  superlinearly 
with  queue  utilization,  a  low- dimensional  mesh  will  always  beat  a 
high- dimensional  mesh  for  sufficiently  heavy  traffic. 

The  2D  meshes  exhibit  lower  injection  and  cut-through  latency 
than  the  3D  meshes  without  reference  to  spooling  latency.  It  is  well 
known  that,  for  fixed  wire  bisection,  a  low- dimensional  mesh  can  have 
lower  latency  than  a  high- dimensional  mesh  [2,  8];  however,  this  is  a 
consequence  of  the  low- dimensional  mesh  having  wider  channels,  thus 
less  spooling  latency.  It  was  thought  that  the  cut-through  latency 
increased  with  radix  (decreased  with  dimension)  while  the  spooling 
latency  decreased  with  radix  (increased  with  dimension).  The  compe¬ 
tition  of  these  two  effects  formed  the  basis  for  computing  the  latency- 
optimal  dimension  [2,  8].  Since  cut-through  latency  does  not  decrease 
as  dimension  increases,  there  is  no  basis  for  such  an  optimization. 

2.3.1  Comparing  Networks  of  Different  Dimension 

The  equal-L  comparison  between  2D  meshes  (8  X  8,  64  X  64)  and  3D 
meshes  (4x4x4,  16  X  16  X  16)  was  biased  in  favor  of  the  3D  meshes. 

Note  that  the  2D  meshes  showed  better  heavy-traffic  performance  de¬ 
spite  being  handicapped  in  the  comparison.  However,  it  is  customary 
to  compare  networks  of  equal  wire  bisection,  since  wire  bisection  gives 
a  measure  of  the  network’s  cost/complexity.  10  If  two  meshes  have 
the  same  bisection,  then  equal  A  values  lead  to  equal  throughput  (mes¬ 
sage  volume).  An  Rx  R  mesh  with  channels  of  width  W  has  bisection 
B  =  R  W.  An  R2/3  X  R2!3  X  R2!3  mesh  must  have  channel  width 

10If  2D  and  3D  networks  are  compared  using  equal  layout  area,  rather  than  equal  bi¬ 
section  bandwidth,  then  the  2D  networks  look  even  better.  For  a  fixed  layout  area,  a  2D 
network  can  be  given  more  bisection  bandwidth  than  a  3D  network  [13]. 
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to  have  bisection  RW .  Equal  bisection  leads  to  equal  injection  rate: 

4  A  (itlT5")  bits  4  A  W  bits 
^3D  R2/3  cycle  R  cycle  ^2D 

Since  the  channels  are  narrower  in  the  3D  mesh,  the  packets  must  be 
longer  to  convey  the  same  information  per  packet.11  An  8x8  mesh 
with  packet  length  L=32  should  be  compared  to  a  4  X  4  X  4  mesh  with 
packet  length  L=64.  A  64  X  64  mesh  with  packet  length  L=32  should 
be  compared  to  a  16  X  16  X  16  mesh  with  packet  length  L=128.  When  a 
fair  comparison  is  made,  the  longer  packet  lengths  for  the  3D  meshes 
further  reduces  their  performance  relative  to  the  2D  meshes.  Table  5 
presents  simulations  in  which  L3D  =  R1/,3L2d- 

With  L3D  =  L2D,  the  2D  meshes  gave  lower  latency  only  for  heavy 
traffic.  With  L3D  =  R1/,3L2d7  the  2D  meshes  give  lower  latency  even  for 
light  traffic.  Again,  the  2D  meshes  are  showing  lower  cut-through 
latency,  not  just  lower  total  latency.  Even  if  the  cut-through  latency 
were  higher,  the  2D  meshes  would  have  lower  total  latency  because 
their  wider  channels  reduce  spooling  latency.  In  addition  to  having 
lower  latency,  the  2D  meshes  can  support  higher  throughput  (Umax). 


3  Summary  of  DOR  Advantages 

1.  Throughput:  DOR  can  support  higher  throughput  than  AR 
on  the  same  network:  Umax  is  greater  for  DOR  than  for  AR. 

2.  Latency:  DOR  gives  significantly  lower  latency  than  AR  for 
heavy  traffic.  It  is  for  heavy  traffic  (communication-limited  com¬ 
putations)  that  network  performance  is  critical.  The  difference 
in  latency  is  small  for  light  traffic. 

3.  Simplicity:  VLSI  implementation  of  DOR  [9]  is  considerably 
simpler  than  implementation  of  AR  [4]. 

(a)  Area:  Dimension-order  routers  are  smaller  than  adaptive 
routers  and  require  only  a  tiny  fraction  («  3%)  of  the  area 
of  a  single-chip  multicomputer  node  [6]. 

11  Rather  than  making  the  packets  longer  to  compensate  for  narrower  channels,  more 
packets  could  be  used  to  send  each  message.  However,  increasing  the  number  of  packets 
increases  the  overhead  due  to  packet  headers,  which  increases  the  applied  load,  which 
increases  latency. 
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Parameters 

DOR  (SNS) 

AR  (ANS) 

d 

R 

■91 

D 

T 

AQLEN 

T 

AQLEN 

2 

8 

32 

H 

9.79 

0.00 

0.00 

3 

4 

64 

H 

13.2 

0.01 

0.01 

2 

8 

32 

.3 

21.6 

18.8 

3 

4 

64 

.3 

46.2 

34.8 

2 

8 

32 

.5 

53.3 

44.1 

3 

4 

64 

.5 

144 

mm 

123 

EH 

2 

8 

32 

.7 

179 

256 

0.92 

3 

4 

64 

.7 

892 

OO 

OO 

2 

64 

32 

HI 

48.6 

0.00 

48.2 

0.00 

3 

16 

128 

B 

41.9 

0.00 

31.3 

0.00 

2 

64 

32 

.3 

63.9 

0.01 

63.4 

0.01 

3 

16 

128 

.3 

127 

0.01 

83.2 

0.01 

2 

64 

32 

.5 

95.5 

100 

3 

16 

128 

.5 

316 

253 

2 

64 

32 

.7 

182 

253 

0.07 

3 

16 

128 

.7 

919 

OO 

OO 

Table  5:  2D  vs  3D  with  Equal  Bisections 
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(b)  Speed:  DOR  implementations  are  smaller  and  simpler  than 
AR  implementations,  so  they  are  faster.  The  simulation 
results  underestimate  the  performance  advantage  of  DOR 
because  they  assume  equal  cycle  times  for  DOR  and  AR. 

(c)  Packet-Order:  With  multipath  routing  (like  AR),  packets 
between  a  given  source  and  destination  may  take  different 
paths  through  the  network  and  arrive  out  of  order;  this  in¬ 
troduces  the  non-trivial  problem  of  reconstructing  packet 
order  at  the  destination  (and  reduces  performance).  With 
single-path  routing  (like  DOR),  packet-order  is  always  pre¬ 
served. 

(d)  Design:  Simpler  routers  are  easier  to  design,  so  for  a  given 
amount  of  design  effort,  more  attention  can  be  paid  to  op¬ 
timizing  the  design. 

4.  Scaling:  The  performance  advantage  of  DOR  over  AR  increases 
with  radix.  DOR  becomes  more  advantageous  as  radix  increases. 
In  particular,  DOR  beats  AR  for  every  applied  load  in  a  128  X 128 
mesh. 


4  Future  Work 

The  data  presented  in  this  report  shows  more  than  the  preferability 
of  DOR  to  AR;  the  data  also  exposes  several  misconceptions  about 
routing  network  behavior  and  performance.  The  data  shows  that  av¬ 
erage  cut-through  latency  is  neither  directly  proportional  to  nor  even  a 
monotone  function  of  average  distance.  The  data  shows  that  DOR  per¬ 
forms  better  than  previously  realized;  indeed,  so  well  that  there  is  no 
reason  to  consider  a  more  complicated  algorithm.  The  misunderstand¬ 
ings  that  were  incidentally  exposed  by  this  adaptive-routing  research 
impact  other  aspects  of  network  design.  For  example,  it  is  clear  that 
the  “optimal”  dimension  for  a  routing  network  is  not  determined  by 
minimizing  the  expression  for  congestion- free  latency  [8].  Networks  of 
different  dimensionality  and  equal  cost  do  not  have  the  same  through¬ 
put,  so  the  dimension  should  be  chosen  to  maximize  throughput  rather 
than  minimize  latency.  The  congestion-free  latency  formula  is  both 
quantitatively  and  qualitatively  incorrect,  even  for  light  traffic.  A  fu¬ 
ture  report  will  show  that  2D  networks  are  throughput-optimal  for 
fixed  area  [13]. 
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Another  future  report  will  elucidate  the  mechanism  responsible  for 
the  superior  performance  of  DOR.  It  is  easy  to  understand  why  adap¬ 
tive  routing  does  not  outperform  dimension-order  routing.  Since  there 
is  only  O(-j^)  contention  for  outputs  with  DOR,  there  is  little  oppor¬ 
tunity  for  AR  to  reduce  contention.  However,  this  does  not  explain 
why  DOR  performs  better  than  adaptive  routing.  The  reason  why 
adaptive  routing  does  not  perform  as  well  as  dimension-order  routing 
is  that  AR.  does  not  utilize  all  the  bisection  channels  evenly.  Simula¬ 
tion  results  proving  this  explanation  of  AR.’s  inferior  performance  will 
be  presented  in  [14].  It  is  important  to  remember  that,  even  though 
a  different  AR  implementation  might  perform  as  well  as  DOR,  AR 
cannot  outperform  DOR. 
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A  Listing  of  Simulator  Code 

/*  sns4.c  -  Improved  convergence  detection. 

*/ 

#include  <stdio.h> 

#include  <malloc.h> 
double  drand48() ; 

#define  CHECK(c,m)  {if (! (c) ) {printf ("ERROR :  */#s\n",m);  exit  (7);}} 
#define  PBY(p)  (drand48 0 < (p) ) 

#def ine  MAX(a,b)  ( ( (a) > (b) )? (a) : (b) ) 

#define  ABS(x)  ( ( (x) <0) ?- (x)  :  (x) ) 

#define  DIFF(old , new)  ABS ( ( (new)- (old) ) / (old) ) 

int  pwr(x,y)  int  x,y;  {int  r=l ;  f or ( ;y>0 ;y--)  r*=x;  return  r;} 


/*  Parameters 
*/ 


#ifndef 

H 

#def ine 

II 

16384 

/* 

number  of 

nodes  */ 

#def ine 

R 

128 

/* 

radix  */ 

#def ine 

d 

2 

/* 

dimension 

*/ 

#endif 

#ifndef 

A 

#def ine 

A 

.5 

/* 

applied  load  */ 

#endif 

#def ine 

L 

32 

/* 

packet  length  in  flits  */ 

#def ine 

ACCURACY  .03 

#def ine 

TOL  (ACCURACY/3.) 

#def ine 

MIHPKT 

(l./(T0L*T0L)) 

int  IITERVAL  = 

(int) (MIIPKT*L/(8*A*I/R) ) ; 

#def ine 

B 

(N/R) 

/* 

bisection 

BW  in  flits/cycl 

#def ine 

hih 

(2*d+l) 

/* 

number  of 

router  inputs  */ 

#def ine 

rout 

(2»d+l) 

/* 

number  of 

router  outputs  *, 

#def ine 

Huwq 

(D*HIB-d*B) 

/* 

number  of 

active  queues  */ 

#def ine 

AQLEI 

( (numsent-numrecd) /HUMQ) 

/*  Measurements 
*/ 

double  numsent=0.; 
double  numrecd=0.; 
double  totlat=0.; 
double  tothops=0.; 


/*  number  of  packets  received  */ 

/*  sum  of  received-packet  latencies  */ 
/*  sum  of  received-packet  distances  */ 


#define  T  (totlat/numrecd)  /* 
#define  TP  ( (numrecd*L) /curtime)/* 
#def ine  U  (.25*TP/B)  /* 
#define  D  (tothops/numrecd)  /* 


average  TOTAL  latency  */ 
throughput  in  flits/cycle  */ 
bisection  utilization  */ 
average  distance  */ 


/*  Data  Structures 
*/ 

typedef  struct  packet  packet; 

struct  packet{int  dest , tsent , tin ,nhops ;  packet  *next;}; 
typedef  struct  nodestate  nodestate; 
struct  nodestate 
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{packet  *head[ITIE]  ,  *tail  [If  I  If]  ;  int  tnh[HIH] ,tfree[HOUT] ,tls , pin, pout ; 

/*  lumbering  Convent ions : 

*  Dimensions  numbered  from  0:  x=0,  y=l ,  z=3 ,  etc... 

*  Channels  numbered:  local=0,  xpred=l ,  xsucc=2 ,  ypred=3,  ... 

*  lode  (x,y,z,...)  numbered:  ...  +  zR~2  +  yR  +  x 
*/ 

#def ine  DIMOF(in)  (((in)-l)/2) 

#define  PRED(dim)  (2*(dim)+l) 

#def ine  SUCC(dim)  (2*(dim)+2) 

#define  EHD(out)  (((out  )'/#2)?((out  )  +  l)  :  ((out)-l) ) 

#define  COQRD(n,dim)  (((n)/pwr(R,dim)  )*/*R) 

#def ine  IGHBR(n,o)  (  (  (o)'/,2)  ?  (n) -pwr(R,DIHOF(o) )  :  (n)+pwr(R  ,DIM0F(o)  )  ) 

/*  Simulator 
*/ 

int  curtime=0;  /*  current  simulation  time  */ 

nodestate  node [H] ;  /*  simulator  state  */ 

initnode(n)  nodestate  *n ;  { 

int  i;  n->pin=n->pout=n->tls=0 ; 
f  or  (i=0  ;  i<IIII ;  i++) 

{n->head[i]=n->tail [i] =(packet*)0 ;  n->tnh [i]=n->tf ree [i] =0 ; } 

} 

init(){  int  n;  for(n=0;  n<H ;  n++)  initnode (&(node[n] ) ) ;  } 
mainO  { 

int  n , et ime=IHTERVAL ;  double  oldT,curT; 
initO;  printf  ("\n\n")  ;  nice(10)  ; 

printf  ("SHS4  ,  d,  R='/,d,  d=%d  ,  L='/,d  ,  A='/,g  ,  ACCURACY=%g\n"  , 

U , R , d , L , A , ACCURACY ) ;  ff lush(stdout ) ; 
printf ("Wait  for  throughput  to  converge . \n" ) ; 
do{  for(  ;  curt ime<et ime ;  curtime++) 

for(n=0;  n<H;  n++)  simulate(n); 
printf  ("curt  ime=*/,d\n"  ,  curt  ime)  ; 

printf  ("numsent=*/,g ,  numrecd='/,g\n"  ,numsent  ,numrecd)  ; 
printf  ("AQLEn='/,g/y*d='/,g\n"  ,numsent-numrecd  , HUH Cj  ,ACjLEH)  ; 
printf  ( "D='/,g  ,  d*  (R-l/R)  /3 .  =%g\n"  ,D  ,  d*  (R-l  ./R)  /3 . )  ; 
printf  ("T='/,g\n"  ,T)  ; 

printf  ("U=y,g,  A='/,g ,  DIFF(A  ,U)=*/,g\n"  ,U,A,DIFF(A  ,U) )  ; 
printf ("\n") ;  ff lush(stdout ) ; 
curT=T ;  etime*=2; 

}  while (DIFF(A ,U) >T0L) ; 

printf("Wait  for  latency  to  converge\n") ; 
do{  oldT=curT ; 

for(  ;  curt ime<et ime ;  curtime++) 

for(n=0;  n<H;  n++)  simulate(n); 
printf  ("curt  ime=*/,d\n"  ,  curt  ime)  ; 

printf  ("numsent='/,g ,  numrecd=y,g\n"  ,numsent  ,numrecd)  ; 
printf  ("AQLEH=y,g/y*d=y,g\n"  ,numsent-numrecd  ,  HUH  Cj  ,ACjLEH)  ; 
printf  ("D=%g,  d*  (R-l/R)  /3  .='/,g\n"  ,D  ,  d*  (R-l  ./R)  /3 . )  ; 
printf  ("U=*/,g\n"  ,U)  ;  printf  ("T='/,g\n"  ,T)  ; 
printf  ("DIFF(oldT  ,T)='/»g,  T0L='/,g\n"  ,DIFF(oldT,T)  ,T0L)  ; 
printf ("\n") ;  ff lush(stdout ) ; 
curT=T ;  etime*=2; 

}  while (DIFF(oldT , curT) >T0L) ;  return  0; 
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} 


/*  lode  Behavior 
*/ 

#define  PHI  node  [n]  . pin 

#define  POUT  node [n] . pout 

simulate (n)  int  n;  { 
int  in , out ; 

if (PBY (4 . *A/ (R*L) ) )  inject (n);  findpin(n) ; 
for(in=0;  in <111 II ;  in++)  for(out=0;  out<H0UT;  out++) 
if  ( al  1  o  wed  (n  ,  (P 11+  in )  7.III ,  (POUT+out ) '/.HOUT ) ) 

{ 

forward  (n  ,  (PIH+in)'/,IIIlI ,  (POUT+out)'/,  I  OUT)  ; 
if  (  !  in)  {PIIJ=(PIH+1)'/,HIH;  f  indpin(n)  ; } 
if  (  !  out )  P0UT=(P0UT+1  )'/,H0UT  ; 

} 

} 

findpin(n)  int  n;  { 

int  i;  packet  *p ;  nodestate  *nd  =  &(node[n]); 
for(i=0;  i<IIII ;  i++) 

if  (  (p=nd->head[PIII]  ) &&(MAX (p->tin ,nd->tnh [PHI]  )<=curtime) ) 
return ; 

else  PIII=(Pin+l)'/,niII; 

} 

inject (n)  int  n;  { 

packet  *p= (packet* )malloc ( (unsigned) sizeof (packet ) ) ; 
node[n].tls  =  p->tin  =  p->tsent  =  MAX (curt ime , node [n] .tls+L) ; 
p->dest  =  drand48() *11 ;  p->nhops=0;  p->next=0; 
enqueue (p ,n ,0) ;  numsent++; 

} 

int  allowed(n , in, out )  int  n, in, out;  { 

int  dim,nc,pc;  packet  *p=node [n] .head [in] ; 

if(!p  | |  p->tin>curtime)  return  0;  /*  p  arrived?  */ 

if (node [n] . tnh[in] >curt ime)  return  0;  /*  p  at  head?  */ 

if (node [n] . tfree [out] >curt ime)  return  0;  /*  out  free?  */ 

for(dim=0;  dim<d;  dim++)  { 

nc=C00RD(n  ,dim) ;  pc=C00RD(p->dest ,dim)  ; 
if (nc<pc)  return  out==SUCC(dim) ; 

else  if (nc>pc)  return  out==PRED(dim) ; 

} 

CHECK(p->dest==n , "prof itable () ") ;  return  out==0; 

} 


f orward(n ,in ,out)  int  n, in, out;  { 

packet  *dequeue();  packet  *p=dequeue(n , in) ;  p->t in=curtime+l ; 
node [n] . tnh [in] =node [n] .tfree [out] =curtime+L; 
if(out==0)  deject(p); 

else  p->nhops++,  enqueue (p ,HGHBR(n , out) ,EHD(out) ) ; 


} 


deject(p)  packet  *p;  { 

numrecd++;  totlat+=p->t in-p->tsent ;  tothops+=p->nhops ; 
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f ree ( (char*)p) ; 

} 

/*  Misc  Functions 
*/ 

packet  ^dequeue (n , in)  int  n,in;  { 

packet  *p=node[n] .head[in] ;  node[n] .head [in]=p->next ; 
if (p==node [n] .tail[in])  node[n] . tail[in]=(packet*)0 ; 
p->next=(packet*)0 ;  return  p; 

} 

enqueue (p ,n , in)  packet  *p ;  int  n , in  ;  { 
nodestate  *nd=& (node [n] ) ; 

if (nd->head [in] )  nd->tail [in] ->next=p ;  else  nd->head[in] =p ; 
nd->tail [in]=p ;  p->next=0; 

} 


A.l  Correction 


Line  33  of  the  listing  is: 

#define  IUMQ  (lJ*IJIIJ-d*B)  /*  num.  active  queues  */ 
but  it  should  be: 


#define  IUMQ  (I*III-2*d*B)  /*  num.  active  queues  */ 


The  simulation  results  were  obtained  with  the  listed  code.  The  missing 
“2”  leads  to  a  small  systematic  error  in  the  AQLEN  values.  AQLEI  values 
are  reported  only  for  insight;  the  network  average  queue  length  at 
termination  is  not  an  adequate  substitute  for  the  time  average.  The 
small  systematic  error  is  insignificant  given  the  rough  nature  of  the 
AQLEI  values.  The  true  network-average  queue-length  can  be  obtained 

by  multiplying  AQLEN  by  C  =  — - Ef.  The  correction  factors  are 

2a+l—2 


tabulated  below. 


R  = 

4 

8 

16 

32 

64 

128 

d=  1 

1.10 

1.05 

1.02 

1.01 

1.01 

1.00 

C'l 

II 

"53 

1.13 

1.06 

1.03 

1.01 

1.01 

1.00 

sa¬ 

il 

CO 

1.14 

1.06 

1.03 

1.01 

1.01 

1.00 
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B  Questions  and  Answers 

When  a  draft  of  this  report  was  circulated,  it  elicited  some  questions 
and  requests  for  more  detail,  ft  was  felt  that  these  questions  would 
not  have  arisen  if  the  conclusions  of  this  report  did  not  challenge  some 
well- entrenched  misconceptions.  Rather  than  add  digressions  to  the 
body  of  the  report,  it  was  decided  to  add  an  appendix  that  specif¬ 
ically  addressed  the  concerns  of  AR  proponents.  A  reader  with  no 
predisposition  in  favor  of  adaptive  routing  should  find  the  main  body 
of  the  report  sufficiently  convincing,  and  need  not  be  bothered  with 
fussy  details.  Readers  inclined  to  find  fault  with  the  report  because  of 
the  nature  of  its  conclusions  will  hopefully  find  satisfactory  answers 
to  most  of  their  questions  below. 

Question  1  What  traffic  pattern  was  used  in  the  simulations,  and 
what  justification  is  there  for  that  choice? 

Answer  1  Random,  homogeneous  traffic  was  used  for  the  simula¬ 
tions.  During  each  cycle,  every  node  injects  a  packet  with  probabil- 
ity  and  all  destinations  are  equally  likely.  For  further  details,  the 
reader  is  referred  to  the  report  describing  the  simulator  [5].  Studies  of 
routing  networks  use  random  traffic  for  at  least  two  reasons: 

1.  Random  traffic  is  the  worst  case.  Performance  is  better  for  lo¬ 
calized  traffic. 

2.  Random  traffic  is  the  most  general.  Special-case  traffic  patterns 
are  of  limited  interest. 

In  particular,  the  previous  reports  comparing  AR  to  DOR  [1]  used 
random  traffic,  so  this  report  must  use  random  traffic  to  refute  those 
results.  Indeed,  random  traffic  favors  AR  because  localized  traffic 
would  lead  to  shorter  path  lengths  and  less  path  multiplicity;  without 
path  multiplicity,  “ adaptive ”  routing  is  impossible.  For  localized  traf¬ 
fic,  there  would  be  no  reason  to  consider  AR.  It  might  be  possible  to 
concoct  a  non-localized  pathological  traffic  pattern  that  favored  AR,  but 
such  chimera  are  not  relevant  to  the  practical  engineering  of  general- 
purpose  routing  networks.  Random  traffic  is  the  worst-case  model  for 
non-localized  communication,  since  any  pattern  can  be  randomized. 
DOR  is  at  least  as  good  as  AR  for  localized  communication,  which 
lacks  path  multiplicity.  DOR  beats  AR  for  non-localized  communica¬ 
tion  because  it  beats  AR  for  random  traffic. 


24 


Question  2  What  effect  would  finite  buffering  have  on  the  results? 
Are  infinite-buffering  results  realistic? 

Answer  2  Finite  buffering  would  increase  the  performance  advantage 
of  DOR,  since  infinite  buffering  favors  AR.  Infinite  buffering  elim¬ 
inates  AR/s  deadlock  problem.  The  techniques  (e.g.,  misrouting  [1]) 
required  to  avoid  deadlock  when  using  AR  with  finite  buffering  would 
only  decrease  the  performance  of  AR.  Moreover,  as  can  be  seen  from 
the  AQLEN  values,  AR  needs  more  buffering  than  DOR  (under  heavy 
traffic),  so  restricting  the  buffering  would  only  further  decrease  AR’s 
performance  relative  to  DOR. 

The  AQLEN  values  also  indicate  that  the  infinite-buffering  results 
are  completely  realistic  because  the  queue  lengths  are  typically  very, 
very  short.  Indeed,  a  DOR  simulator  with  FIFOs  only  one  packet 
long  gives  the  same  results  as  a  simulator  with  unbounded  FIFOs  for 
practical  topologies  [5].  In  short,  finite  buffering  would  leave  DOR 
results  unchanged  and  worsen  the  AR  results. 


Question  3  What  would  be  the  effect  of  non-uniform  traffic  where 
there  was  a  concentration  of  traffic  to/from  a  subset  of  the  nodes? 

Answer  3  This  question  is  a  red  herring,  as  explained  in  the  Intro¬ 
duction.  The  most  general  rebuttal  to  this  question  is  that  DOR  is  the 
extant  routing  technique,  so  the  burden  of  proof  lies  with  the  proponent 
of  an  unproven  alternative.  If  there  is  a  non-uniform  traffic  model 
that  is  sufficiently  common  to  be  of  interest,  then  the  AR  proponent 
is  obliged  to  demonstrate  significantly  superior  performance  for  AR 
under  that  traffic  model.  In  fact,  no  such  non-uniform  traffic  model 
has  been  proposed.  Indeed,  a  chronically  non-uniform  traffic  pattern  is 
liable  to  arise  only  if  either  the  program  design  or  process  placement 
is  poor.  Dynamic  hot  spots  are  handled  better  by  DOR  than  by  AR. 
DOR  dissipates  traffic  fluctuations  more  quickly  because  it  has  lower 
latency  and  higher  throughput.  The  technical  rebuttal  to  this  question 
is  that  random  traffic  is  the  accepted  standard  for  network  analysis, 
so  there  is  no  obligation  to  consider  every  special  case  that  might  be 
concocted. 


Question  4  How  does  ANS  choose  an  output  when  several  are  al¬ 
lowed,  and  how  does  this  choice  effect  performance? 
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Answer  4  As  described  in  the  simulator  report  [5],  ANS  chooses  the 
first  free  allowed  output,  and  the  choice  does  not  effect  performance 
significantly.  If  several  profitable  outputs  are  free,  advancing  the  packet 
in  the  dimension  with  the  largest  offset  would  preserve  the  most  path 
multiplicity,  but  previous  studies  have  shown  no  significant  effect  from 
the  details  of  output  selection  [1],  Even  if  the  output  selection  had  a 
noticeable  impact  on  path  multiplicity,  it  would  not  effect  the  conclu¬ 
sions  of  this  report.  If  path  multiplicity  were  decreased,  there  would 
be  less  difference  between  AR  and  DOR  (as  in  ID);  if  AR  is  no  better 
than  DOR,  then  there  is  no  justification  for  its  greater  cost/complexity. 
If  path  multiplicity  were  increased,  the  differences  between  AR  and 
DOR  would  be  amplified  (as  in  3D),  in  which  case  the  advantages  of 
DOR  would  be  magnified. 


C  Effect  of  Packet  Length 

The  simulations  presented  in  this  report  have  used  a  packet  length 
of  L=32,  which  is  realistic  for  the  Mosaic  and  appropriate  for  com¬ 
paring  AR  to  DOR.  However,  it  is  desirable  to  study  the  effects  of 
every  parameter  —  all  other  parameters  have  been  varied  over  a  wide 
range  —  so  this  appendix  will  present  some  simulations  using  different 
packet  lengths.  It  would  be  very  time-consuming  to  repeat  all  of  the 
previous  simulations  using  different  packet  lengths;  the  simulations 
for  this  report  have  used  >  3000 hours  of  CPU  time  on  a  network  of 
~  30  Sun  SPARCstations.  Two-dimensional  meshes  are  the  topology 
of  interest,  so  this  section  will  present  data  for  8  X  8,  32  X  32,  and 
128  X  128  meshes.  Short  packets  (L=8)  and  long  packets  (L=128)  will 
be  compared  with  the  medium  packets  (L=32)  used  so  far.  The  data 
is  presented  in  Table  6. 

Observation  12  The  latency  difference  between  AR  and 
DOR  generally  increases  with  packet  length. 

Observation  13  Cut-through  latency  generally  increases 
with  packet  length. 

The  performance  difference  between  AR  and  DOR  is  amplified  by 
packet  length.  If  AR  latency  is  slightly  lower  than  DOR  latency,  then 
this  difference  increases  with  L.  If  AR  latency  is  greater  than  DOR 
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Table  6:  Effect  of  L 
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latency,  then  this  difference  also  increases  with  L.  The  L=32  value 
used  in  this  report  is  slightly  higher  than  the  L  ~  20  estimated  for 
fine-grained  computations  [8].  A  shorter  packet  length  would  increase 
the  advantage  of  DOR  over  AR. 

Cut-through  latency  increases  with  packet  length.  Spooling  la¬ 
tency  also  increases  with  packet  length:  Tspooi  =  L.  The  increase 
in  latency  with  packet  length  suggests  using  short  packet  lengths. 
However,  short  packets  have  relatively  more  overhead.  For  exam¬ 
ple,  if  two  flits  are  required  for  a  packet  header,  and  there  are  P 
flits  of  payload  per  message,  then  n  =  packets  are  required  for 

each  message,  and  there  are  2 n  overhead  flits  per  message.  In  this 
case,  the  applied  load  would  increase  as  the  packet  length  decreased: 
A  oc  1+^r  ~  l  +  Thus,  decreasing  the  packet  length  increases  the 
applied  load  and  thereby  increases  latency.  The  competition  between 
latency  increasing  with  packet  length  and  applied  load  decreasing  with 
packet  length  implies  an  optimal  packet  length.  Determining  the  op¬ 
timal  packet  length  for  interesting  topologies  might  be  an  interesting 
topic  for  future  study. 
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